Day 10 - Sorting and reducing
– But in the Latin alphabet, Jehovah begins with an “I”.
Indiana Jones and the Last Crusade (1989)
Sorting data is paramount in so many cases. We sort data to ease the visual browsing, to pick the top
entries according to some quantity, or to spot duplicates. Needless to say, Unix provides a specific
tool for sorting, which is unsurprisingly called sort.
So, we are hackers, we pose as the keepers of deep and powerful knowledge about computers, and
we go around sorting things with a command called sort? We definitely have to come up with
some command with an obscure and meaningless name! Oh, wait, there are grep and sed, but we
will discuss about them in a later lesson.
For the time being, let’s spend some time with the self-explanatory sort. Since sorting is so simple
and straightforward, sort is a command with… only 30 possible options. Well, it turns out simple
things are many times not so simple! I will show you only a couple of these options, the ones I
happen to use the most, but remember that you can read the man page and discover new useful
ones.
For starters, let’s use the bare command
$ cat examples.txt | sort
or, if you prefer
$ sort examples.txt
Nice and simple. You might notice some oddities, though. Why the empty line at the beginning?
And why is “* TM Sony Pictures” listed after “Spider-Man []”? By default, sort uses the so-
called dictionary order, that, according to the man page, “considers only blanks and alphanumeric
characters”. This means that the “ “ part of the string is not considered, and the position is given by
the “T” letter of “TM”, while the first line is the empty line that was in the second-to-last position
in the original file. being empty, that is considered as a pure space that comes before letters in the
ASCII code. According to the ASCII code, numbers come before letters as well, which is why “007”
is positioned before “aardvark” in the sorted output.
When sorting numbers, however, the dictionary order gives results that are probably not what we
want